fix: restore per-cycle span duration for execute_event_loop_cycle#1939
fix: restore per-cycle span duration for execute_event_loop_cycle#1939Di-Is wants to merge 3 commits intostrands-agents:mainfrom
Conversation
PR strands-agents#1293 wrapped event_loop_cycle() in use_span(end_on_exit=True) and removed explicit span.end() calls. Because event_loop_cycle is an async generator, yield keeps the context manager open across recursive cycles, causing all execute_event_loop_cycle spans to share the same OTel end_time. Switch to end_on_exit=False and explicitly call span.end() via _end_span() in end_event_loop_cycle_span() and end_model_invoke_span(), restoring end_span_with_error() in all exception paths.
…boardInterrupt Trace spans were not properly closed when BaseException (e.g. KeyboardInterrupt, asyncio.CancelledError) was raised. Add explicit BaseException handlers to close spans and aclose() calls to ensure async generators are cleaned up.
Reduce overhead by limiting force_flush calls to agent span completion instead of every span end. Add flush parameter to _end_span() with default False, passing True only from end_agent_span().
|
@Di-Is thanks for this; going to take a look. Would it be possible to have a PR isolated to the span issue, and then spin the rest to a separate PR? It would make it easier to digest |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| flush: Force the tracer provider to flush after ending the span | ||
| """ | ||
| if not span: | ||
| if not span or not span.is_recording(): |
There was a problem hiding this comment.
Issue: The check not span.is_recording() is a good defensive addition, but the docstring doesn't mention this guard condition.
Suggestion: Update the docstring to note that the method is a no-op if the span is not recording:
"""Generic helper method to end a span.
This method is a no-op if span is None or not recording.
Args:
...|
Re: Splitting the PR I think this PR is appropriately scoped given that commits 1 and 2 are closely related:
These are two halves of the same fix - separating them would leave the codebase in an incomplete state between merges. Commit 3 (force_flush optimization) is indeed independent and could be split out if preferred, though it's a small, low-risk change. The author has structured commits well for independent review - each is reviewable on its own with clear purpose. |
Review SummaryAssessment: Comment (with minor suggestions) This PR correctly addresses a real regression in span timing for observability. The fix is well-reasoned: async generator Review DetailsStrengths
Suggestions (Non-blocking)
Questions for AuthorThe force_flush optimization in commit 3 - have you verified this doesn't impact span export reliability for short-lived agent invocations? The change seems reasonable but curious about your testing. Overall this is a solid bug fix. Thank you for the detailed analysis and test coverage! 🎉 |
+1 |
|
Correction to my earlier review comment about exception handlers: After re-analyzing the code more carefully, I realize my suggestion to consolidate all Looking at the three locations:
The design is intentional and correct - Only the first location (lines 166-171) could be simplified if desired, but it's a minor point. |
Updated Review SummaryAssessment: Approve ✅ After re-reviewing with the latest changes and analyzing the exception handling patterns more carefully, this PR is well-designed: Key PointsException Handling DesignThe separate
Test CoverageThe tests comprehensively cover:
Minor NoteOnly lines 166-171 could potentially be simplified (both handlers do the same thing), but this is a very minor point. The PR correctly fixes the span duration regression while handling all edge cases properly. Ready to merge! 🚀 |
|
Execution/run via langfuse: https://us.cloud.langfuse.com/project/cmct6aysb00eaad07kq618eto/traces/1a462b3368a0e6a5a0b529551df069b4?timestamp=2026-03-31T18%3A06%3A35.626Z, showing the fixed times |
| ) | ||
|
|
||
| self._end_span(span, attributes, error) | ||
| self._end_span(span, attributes, error, flush=True) |
There was a problem hiding this comment.
nit: _end_span is a shared helper used by all span types (model, tool, cycle, agent). The flush param only applies to the agent span, so it feels out of place threading it through this parameter. Could just call force_flush() directly in end_agent_span after the _end_span call and remove the span-level flush in _end_span.
ie.
def end_agent_span(self, span, response=None, error=None):
...
self._end_span(span, attributes, error)
if self.tracer_provider and hasattr(self.tracer_provider, "force_flush"):
try:
self.tracer_provider.force_flush()
except Exception as e:
logger.warning("error=<%s> | failed to force flush", e)
I think this threading makes sense for introducing minimal changes/disruptions in the codebase, but it does give me some pause on clarity/maintainability.
|
Looks like |
I'm going to force push this to remove my commits and split out to a separate PR containing only the span fix |
ab54df7 to
1b7b6f2
Compare
29eaae5 to
1b7b6f2
Compare
|
Hey @Di-Is , thanks for the fixes; I cherry-picked the duration fix to fix: restore explicit span.end() to fix span end_time regression by zastrowm · Pull Request #2032 · strands-agents/sdk-python and got it merged in. Sorry for all the chatter on this PR; eventually to get the fix in sooner than later we had to split it out, so I reverted my changes to this PR. If you're willing, the other two commits would be good as standalone PRs; |
Description
Since v1.24.0 (PR #1293),
execute_event_loop_cyclespans no longer reflect per-cycle duration. When a cycle performs tool use and recurses, the parent cycle's native OTel span stays open until all recursive children complete, producing cumulative bottom-up latency instead of per-step latency in observability backends (Langfuse, Jaeger, etc.).The root cause:
event_loop_cycle()is an async generator whose body was wrapped inuse_span(end_on_exit=True). Becauseyieldkeeps the context manager open across recursive cycles, allspan.end()calls fire simultaneously when the generator chain unwinds. The logical metadata (gen_ai.event.end_timeattribute) is set at the correct time, but backends use the native OTelendTimeUnixNanofromspan.end().Three commits, each reviewable independently:
fix: restore explicit span.end() to fix span end_time regressionend_on_exit=Falseand restores explicitspan.end()calls inend_event_loop_cycle_span()andend_model_invoke_span(), withend_span_with_error()on exception paths.fix: handle BaseException in trace spans to prevent span leaks on KeyboardInterruptend_on_exit=False, spans must be explicitly closed on all paths.except BaseExceptionhandlers andaclose()for async generators to coverKeyboardInterruptandasyncio.CancelledError.perf: only force flush tracer provider when ending agent spans_end_span()previously calledforce_flush()on every span end; this limits it to agent span completion only.Resolve #1930, #1938
Related Issues
execute_event_loop_cyclespans all share the same OTel end_time #1938 (duplicate, closed)Documentation PR
N/A
Type of Change
Bug fix
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.